Red Wine Exploration by Hassan Jummah

This dataset is public available for research. The details are described in [Cortez et al., 2009]

Please include this citation if you plan to use this database:

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016 [Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf [bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib

In the above reference, two datasets were created, using red and white wine samples. The inputs include objective tests (e.g. PH values) and the output is based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent). Several data mining methods were applied to model these datasets under a regression approach. The support vector machine model achieved the best results. Several metrics were computed: MAD, confusion matrix for a fixed error tolerance (T), etc. Also, we plot the relative importances of the input variables (as measured by a sensitivity analysis procedure).

Univariate Plots Section

##  Number of Rows =  1599 
##  Number of Columns =  12
##    fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1            7.4             0.70        0.00            1.9     0.076
## 2            7.8             0.88        0.00            2.6     0.098
## 3            7.8             0.76        0.04            2.3     0.092
## 4           11.2             0.28        0.56            1.9     0.075
## 5            7.4             0.70        0.00            1.9     0.076
## 6            7.4             0.66        0.00            1.8     0.075
## 7            7.9             0.60        0.06            1.6     0.069
## 8            7.3             0.65        0.00            1.2     0.065
## 9            7.8             0.58        0.02            2.0     0.073
## 10           7.5             0.50        0.36            6.1     0.071
##    free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                   11                   34  0.9978 3.51      0.56     9.4
## 2                   25                   67  0.9968 3.20      0.68     9.8
## 3                   15                   54  0.9970 3.26      0.65     9.8
## 4                   17                   60  0.9980 3.16      0.58     9.8
## 5                   11                   34  0.9978 3.51      0.56     9.4
## 6                   13                   40  0.9978 3.51      0.56     9.4
## 7                   15                   59  0.9964 3.30      0.46     9.4
## 8                   15                   21  0.9946 3.39      0.47    10.0
## 9                    9                   18  0.9968 3.36      0.57     9.5
## 10                  17                  102  0.9978 3.35      0.80    10.5
##    quality        rating
## 1        5       Average
## 2        5       Average
## 3        5       Average
## 4        6       Average
## 5        5       Average
## 6        5       Average
## 7        5       Average
## 8        7 Above average
## 9        7 Above average
## 10       5       Average
## 'data.frame':    1599 obs. of  13 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : Ord.factor w/ 6 levels "3"<"4"<"5"<"6"<..: 3 3 3 4 3 3 3 5 5 3 ...
##  $ rating              : Ord.factor w/ 3 levels "Below average"<..: 2 2 2 2 2 2 2 3 3 2 ...
##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol      quality
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40   3: 10  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50   4: 53  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20   5:681  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42   6:638  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10   7:199  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90   8: 18  
##            rating    
##  Below average:  63  
##  Average      :1319  
##  Above average: 217  
##                      
##                      
## 

Our dataset consists of 13 variables , with almost 1599 observations

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

As we notice in the pH and density histogram graphs in red wine quality dataset, the distribution shape is symmetric and it has the beautiful bell shape which indicate that the data well distributed

Based on the graph of free.sulfur.dioxide , the most common amount is approximately 7 which. falls in the left side of the distribution

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

There are more sulphates between 0.0 - 1.0 than 1.1 - 2.0 in the samples available in the dataset.

The distribution shape of the free.sulfur.doxide, total.sulfur.doxide, fixed.acidity and alcohol is taking the right skewed shape.

Residual sugar histogram shape is right skewed with most common values between (1-3). However, according to the box polt, Residual sugar has outlier

#### Chloride histogram shape is right skewed with most common values between (0.1-0.17). However, according to the box polt, Chloride has outlier

It seems that the most common quality of the red wine samples will fall from 5-6 where I would assume that 8 is the best quality based on the physicochemical tests.

The interesting question that could be ask here is, what characteristics would be associated with the most common red wine quality?

Univariate Analysis

What is the structure of your dataset?

There are 1599 observation in red wine quality dataset with 13 attributes ( fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol, Output variable, quality)

What is/are the main feature(s) of interest in your dataset?

The main features in the dataset is the quality . I would like to know what attributes associated with different wine quality samples. Does the sugar or chloride affect the quality ? I would assume that sugar will definitely affect the quality of the red wine

What other features in the dataset do you think will help support your
#### Chlorides , sugar , acidity, Alcohol percentage and density

Did you create any new variables from existing variables in the dataset?

I created new column and call it ‘Rating’. Rating column depends entirely on the values entered in the quality column as the following:

- Quality < 5 = ‘Below average’

- 5 < Quality < 7 = ‘Average’

- 7 < Quality < 10 = ‘Above average’

The idea behind creating the rating column group all values in 3 categories and start analyzing the data based on the three categories

Of the features you investigated, were there any unusual distributions?
####citric.acid distribution has unusual histogram

Bivariate Plots Section

The fixed acid of the red wine tend to moderately correlate pH, density and critic.acid .

lets now try to subset the data and look to correlation of the worst (3) and best (8) red wine quality in the dataset .

## Warning in cor(x, use = use, method = method): the standard deviation is
## zero

The above table is the subset of the red wine dataset which has on only the data that has the maximum quality (8). We can confirm that density correlate with quality .

## Warning in cor(x, use = use, method = method): the standard deviation is
## zero

In the other hands, the above plot explain the correlation of the red wine worst quality chrematistics. Alcohol percentage have strong correlation with pH .

Density has a very strong correlation with Fixed Acidity.

We can see from the boxplot above that higher quality wine has less outlier

## Warning: Removed 21 rows containing non-finite values (stat_boxplot).
## Warning: Removed 21 rows containing missing values (geom_point).

Wouldn’t you assume that sugar always make things taste better? Wouldn’t surprise you that in the red wine dataset this is not the case ?

## Warning: Removed 41 rows containing non-finite values (stat_boxplot).
## Warning: Removed 41 rows containing missing values (geom_point).

Next, I’ll look at how the categorical features vary with sulfate and qulity

Bivariate Analysis

Talk about some of the relationships you observed in this part of the . How did the feature(s) of interest vary with other features in dataset?

- Better wine quality is associated with alcohol

- It is worth mentioning that Fixed Acidity seems to have pH almost no effect on quality.

- Better wine tends to have less pH concentration

- Interestingly, residual.sugar has no effect on making great red wine taste I assumed that sugar has strong relationship with better wine quality

Did you observe any interesting relationships between the other features  (not the main feature(s) of interest)?

What was the strongest relationship you found?

- Fixed acidity and density have strong correlation in the red wine quality dataset

Multivariate Plots Section

Since we have discovered that Alcohol has strong correlation with quality . We would like to find out if alcohol is affected with other attributes.

Sulfate and alcohol percentage did not affect each other .

Well, it seems that less volatile.acidity and alcohol percentage could contribute to better wine quality

The plot above is showing clearly that there is strong negative correlation between better quality and density .

Multivariate Analysis

Talk about some of the relationships you observed in this part of the . Were there features that strengthened each other in terms of at your feature(s) of interest?

- alcohol increases and the volitile acidity decreases

- There is strong negative correlation between alcohol and density

Were there any interesting or surprising interactions between features?

No, there were not any surprise

OPTIONAL: Did you create any models with your dataset? Discuss the  strengths and limitations of your model.


Final Plots and Summary

Plot One

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Description One

Density values in the red wine quality dataset is normally distributed. Most values is between 0.990 - 1.000. The most common value in the density is 0.997. The shape of the density histogram is symmetric

Plot Two

Description Two

Red wine quality 5 has alcohol outliers and it shows that quality 8 has strong correlation with alcohol

Plot Three

Description Three

In order to make better wine quality according to the dataset, the density amount should be less as the alcohol percentage increases.


Reflection

- The dataset doesn’t have enough data to make valid conclusion . The 1599 records in the dataset is not huge enough to make recommendation.

- The dataset did not mention from where they have been collected . I believe the result of this analysis can be applicable to specific geographic location but not another.

- Assumption is the mother of all mistake! and this is applicable in the data analyst. I assumed that sugar will definitely have great affect on the quality of the wine. However, as the shown in the plots above. Sugar is not a factor to determine the quality of the red wine

- installing R packages on Mac is an exhausting process. It took me 3 days to fix an issue I have in Mac environment. It was preventing me from installing package. I even post a question on StackOverFlow website and finally I was able to fix the issue more details on my post below: